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Compression index is an_ effective assessment of primary 
consolidation settlement of clayey soils, but the process of 
obtaining compression index is time-consuming and __ laborious. 
Thus, in the present study, we developed two classical tree-based 
techniques: random forest (RF) and extreme gradient boosting 
(XGBoost), to predict the compression index of clayey soils. To 
establish these two models, we collected an available dataset— 
including 391 consolidation tests for soils—from previously 
published research. The dataset consists of six physical parameters, 
including the initial void ratio, natural water content, liquid limit, 
plastic index, specific gravity, and soil compression index. The 
first five parameters are the models’ inputs while the compression 
index is the models’ output. We trained both two tree-based models 
using 90% of the entire dataset and used the remaining 10% to 
assess the well-trained models, which is _ consistent with the 
published research. Several statistical metrics, such as coefficient 
of determination (R°), root mean squared error (RMSE), mean 
absolute error (MAE), and mean _ absolute percentage error 
(MAPE), are the criteria for assessing the models’ performance. 
The results show that the RF model has better accuracy in 
predicting compression index compared with the XGBoost model 
because it outperforms the XGBoost model both on the training 
and testing datasets. The performance of the RF model is R’ of 
0.928 and 0.818, RMSE of 0.016 and 0.025, MAPE of 7.046% and 
10.082%, and MAE of 0.012 and 0.020 on the training and testing 
datasets, respectively. The sensitivity analysis reveals that the 
initial void ratio has a significant impact on the compression index 
of clayey soils. 
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1. Introduction 


The compression index of clayey soils is a measure of the soil's ability to compress or 
consolidate under an applied load [1]. It is a crucial property for engineers to consider when 
designing foundations, as it can affect the stability and settlement of the structure. The 
compression index of clayey soils is typically determined by conducting a series of oedometer 
laboratory tests on soil samples [2,3]. These tests involve applying increasing levels of stress to 
the soil and measuring the resulting consolidation or compression. The compression index is then 
calculated based on the amount of compression that occurs under a given stress level [4]. In 
general, clayey soils with a high compression index will be more prone to settlement and 
instability under load, while those with a low compression index will be more stable [5,6]. 
Engineers must consider the compression index of the soil when designing foundations to ensure 
that the structure is stable and will not experience excessive settlement. 


Since conducting oedometer tests is time-consuming, costly, and unwieldy, scholars tried to 
create empirical formulas to predict the compression index [7—10]. However, most empirical 
formulas are based on the on-site environment and thereby their universality is insufficient. The 
empirical formula may not account for variations in soil properties and conditions that can affect 
the compression index. Additionally, empirical models are based on a limited amount of data and 
may not be accurate for all types of clay soils [11-13]. 


Encouragingly, with the rapid development of the soft computing technique, many scholars paid 
attention to its computational efficacy and high accuracy. Since the soft computing technique has 
been successfully used in different disciplines of civil engineering [14—21], researchers 
attempted to apply the soft computing technique to establish the relationship between the basic 
soil properties and the compression index [22,23]. Kurnaz et al. developed an artificial neural 
network (ANN) model to predict the compression and recompression index. The model was built 
on a dataset that consists of 246 laboratory oedometer tests, and the model’s inputs (soil 
properties) included the natural water content, liquid limit, plastic index, and specific gravity of 
soil particles [24]. Kordnaeij et al. proposed a group method of data handling (GMDH) type 
neural network to predict the recompression index. The used dataset, compiled from 344 
consolidation tests for soils, included the soil properties such as the liquid limit, initial void ratio, 
specific gravity, natural water content, plastic index, and dry density [25]. Nguyen et al. 
proposed a hybrid ANN model: Biogeography-Based Optimization ANN. They used 188 soil 
samples to build the hybrid ANN model. The input parameters include the depth of samples, 
clay, moisture content, bulk density, dry density, specific gravity, void ratio, porosity, degree of 
saturation, liquid limit, plastic limit, plastic index, and liquid index. The principle component 
analysis (PCA) was used to reduce the dimension of input parameters [26]. Benbouras et al. 
exploited the performance of the multilayer neural network, genetic programming, and multiple 
regression in predicting the compression index. They used 373 oedometer test samples to 
develop the machine learning models. The best prediction model was established based on the 
input variables: wet density, water content, liquid limit, plastic index, void ratio, and fine 
contents [27]. 
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Overall, the above-mentioned researches mainly focus on the ANN or ANN-based models. To 
the best knowledge of the authors, no relevant researches discuss the application prospect of tree- 
based models in predicting the compression index of clayey soils. Considering the merits of tree- 
based models, for example, they can handle a large number of features and still maintain good 
accuracy, and they are easy to be interpreted and explained because they are based on a set of 
decision trees [28], we propose a hypothesis: the tree-based model could perform well in this 
topic. Based on this, we will develop the models for predicting the compression index using the 
tree-based technique. The developed tree-based models are random forest (RF) and extreme 
gradient boosting (XGBoost). First, we collected a dataset of clayey soils from a published 
article (Ref. [29]) to establish these two tree-based models. Meanwhile, we used the grid search 
algorithm to seek the optimal hyperparameters of the models. By comparing their performance 
using some evaluation metrics, we finally determined the best model for predicting the 
compression index of clayey soils. Our main contribution is: we verified the promising 
application of tree-based models in predicting the compression index. 


The rest of the paper is organized as follows: Section 2 presents the background of the data 
source; Section 3 describes the principle and implementation of the tree-based models; Section 4 
discusses the main results of modeling; Section 5 summarizes the main conclusions. 


2. Materials 


In the present study, we collected a dataset that includes 391 experimental samples from a 
previously published article. The dataset is composed of the experimental results of consolidation 
testes (ASTM D 2435-96) for soils that were sampled at 125 construction sites in the north of 
Iran [29]. It mainly contains the physical properties of clayey soils, such as natural water content 
(@n), liquid limit (LL), plastic index (PJ), initial void ratio (eo), the specific gravity of soil 
particles (G,), and compression index (C,). Our goal is to build an effective relationship between 
the compression index and another five physical properties of clayey soils, with the help of the 
tree-based machine learning models. 


Before beginning to develop the tree-based models, we need to do pre-processing on the dataset. 
Since the experimental tests may be subject to human-induced error, outliers could exist in the 
dataset, which will harm the performance of tree-based models. Thus, we use the boxplot method 
to detect the outliers of the dataset—which is a common way in statistics [30]. Boxplot can show 
the visualization of the five-number summary: the extreme lower (Min), the extreme upper 
(Max), the first quartile (Q1), the third quartile (Q3), and the median. Figure 1 shows the data 
distribution of the physical properties of clayey soils. The box extends from Q1 to Q3 of the 
data; the red line and rhombus point represent the median and mean values, respectively; and the 
black circle point denotes the outliers of each variable [31]. Intuitively, the outliers exist in each 
variable and should be removed. After removing the outliers, 349 data samples were available to 
develop the tree-based models. Table 1 presents the statistical indices of each variable in the new 
dataset. We can find that the range of natural water content is between 12.7% and 42.1%; the 
range of liquid limit is between 24% and 62%; the range of the plastic index is between 4% and 


37%; the range of initial void ratio is between 0.476 and 1.059; the range of specific gravity of 
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soil particles is between 2.5 and 2.77; the range of compression index is between 0.05 and 0.385. 
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Fig. 1. Boxplot of the collected Cc dataset. 
Table 1 
Statistical indices of the cleaned C, dataset. 
Variables Symbol Unit Min. Max. Mean Std. Dev. 
Natural water content On % 12.7 42.1 27.418 5.427 
Liquid limit LL % 24 62 38.883 8.675 
Plastic index PI % 4 37 17.848 7.633 
Initial void ratio eo - 0.476 1.059 0.739 0.115 
Specific gravity of soil particles G, - 2.5 2.77 2.64 0.054 
Compression index C, - 0.05 0.385 0.194 0.059 


When conducting the modeling, a common way is to divide the entire dataset into two parts: the 
training dataset and the testing dataset. In this way, it can effectively examine the model’s 
generalization ability and help in avoiding overfitting. Thus, we randomly split the 349 data 
samples into two sets: one is the training dataset (90% of the entire data) which has 314 samples, 
and the other one is the testing dataset (10% of the entire data) which has 35 samples. Such a 
splitting strategy is consistent with the published article [29], and we anticipate verifying 
whether our developed models have a better performance compared with the model in that 
published article. We use the training dataset to establish the tree-based models for predicting the 
compression index and then use the testing dataset the examine the models’ generalization 
ability. To make the random division valid, a key rule that should be obeyed is to keep the 
training and testing datasets have similar statistical properties. Herein, we used the cumulative 
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distribution function to judge whether the training dataset has acceptable statistical similarity 
with the testing dataset. 


A cumulative distribution function can describe the probability distribution of a continuous 
random variable, and it is a non-decreasing function that ranges from 0 to 1 as the value of the 
random variable increases from negative infinity to positive infinity [32]. Figure 2 illustrates the 
cumulative distribution of variables in the training and testing datasets. We find that the 
variables: @n, €o, Gs, and C,, in both the training and testing datasets, have similar tendencies and 
shapes. But for variables: LL and PI, they have slight differences because the line’s position of 
the testing dataset is below that of the training dataset. This might be because the training dataset 
involves more instances compared with the testing dataset, which incurs that the variables (LL 
and PI) have lower cumulative probabilities. Additionally, we also observe that the range of each 
variable in the training dataset almost covers that in the testing dataset—according to the x-axis 
in each subplot. This can confirm that the models fitted on the training dataset would show 
promising performance on the testing dataset. From the above analysis, we believe that the 
division of the training and testing dataset is reasonable, and they can be used to conduct the 
modeling accordingly. 
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Fig. 2. Cumulative distribution of variables in training and testing datasets. 


Furthermore, we present the linear relationship between the compression index and each 
variable—only applied to the training dataset, as shown in Figure 3. Intuitively, the initial void 
ratio (eo) has a relatively strong relationship with the compression index, followed by the natural 
water content (@,). As for the liquid limit (LL), plastic index (PI), and specific gravity of soil 
particles (Gs), they all show an insignificant relationship with the compression index. From this 
point, we believe that a sophisticated model should be constructed to characterize the intrinsic 
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relationship between these variables and the compression index. The subsequent sections will 
discuss it deeply. 
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Fig. 3. Relationship between compression index and soil properties. 


3. Methods 


3.1. Random forest (RF) 


RF is a supervised machine learning algorithm that is used for both classification and regression 
tasks [33]. It is an ensemble model that is composed of multiple decision trees, which are trained 
on different samples of the training data and then aggregated to make a final prediction. The key 
idea behind RF is to create a diverse set of decision trees, each of which is trained on a randomly 
selected subset of the training data and a randomly selected subset of the features. This process, 
known as bootstrapping, helps to reduce the variance of the model and make it more robust. 
During the training process, each decision tree in the RF makes a prediction based on the 
features in its training set. The final prediction of the RF is then made by aggregating the 
predictions of all the individual decision trees, for example, by taking the average for regression 
tasks, as shown below: 


y=" T(x) () 


K 


where y represents the average of prediction results, K is the number of decision trees, and T;(x) 
represents the predicted results of a single decision tree. 


One of the main benefits of using a random forest model is that it can handle large amounts of 
data and a high number of features, and it is also resistant to overfitting. Additionally, it is 
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relatively easy to interpret and understand, as the individual decision trees are simple models that 
can be inspected and analyzed. 


RF has two key hyperparameters: the number of trees and the maximum depth, which can highly 
affect its performance [34]. The number of trees denoted how many decision trees are in a forest 
and it dramatically controls the prediction accuracy of the RF model. For another 
hyperparameter: maximum depth, its role is to reduce the RF model’s complexity to avoid 
possible overfitting. In the present study, we aim to seek the optimal values of these two 
hyperparameters and thus construct a high-performance RF model for predicting compression 
index. 


3.2. Extreme gradient boosting (XGBoost) 


Extreme gradient boosting (XGBoost) is a supervised machine learning algorithm that is used for 
both classification and regression tasks [35]. It is an ensemble model that is composed of 
multiple decision trees, which are trained sequentially in a way that allows the model to learn 
and improve from the mistakes made by earlier trees. 


XGBoost is a variant of the gradient boosting algorithm, which is a type of boosting algorithm 
that is based on the concept of boosting weak learners to form a strong learner. Boosting 
algorithms work by iteratively adding weak models to the ensemble and adjusting the weights of 
the training data so that the mistakes made by the previous models are emphasized and corrected 
in the subsequent models [36]. In the XGBoost model, the decision trees are trained using 
gradient descent to minimize the loss function, which measures the difference between the 
predicted values and the true values in the training data. The loss function of the XGBoost model 
is shown below: 


Xi = Doval(yy)+ Oy) (2) 


where X,,; represents the objective function, pseu? y) represent the predictive loss between 


the predicted and real values, ye QA fi.) represents the regularization term that is used to avoid 


overfitting. In general, the technique of minimizing a quadratic function is the way to optimize 
the objective function [37]. 


When constructing the XGBoost model, two key hyperparameters should be considered, that is, 
the number of trees and the learning rate. The number of trees refers to the maximum number of 
gradient-boosted trees. It controls the predictive accuracy of the XGBoost model. In general, if 
its value is too low/high, the model will encounter underfitting/overfitting. The learning rate 
refers to the step size shrinkage in each iteration. It can make the boosting process more 
conservative. In the present study, we aim to seek the optimal values of these two 
hyperparameters and thus construct a high-performance XGBoost model for predicting 
compression index. 
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3.3. Evaluation criteria 


To quantitatively assess the accuracy of the above-mentioned RF and XGBoost models, some 
commonly used regression evaluation metrics are utilized, for example, coefficient of 
determination (R’), root mean squared error (RMSE), mean absolute percentage error (MAPE), 
and mean absolute error (MAE). The following equations are used to compute these metrics. 


wre ss es (3) 


~0%- y) 


RMSE = [—Y)"(, -y,) (4) 


1 nly), 
MAPE = a ae a x 100% (5) 
1 N 
MAE =— DA bien? (6) 


where y, denotes the measured compression index, y, denotes the predicted compression index, 


y denotes the average of y,, and N is the number of samples. For the above four metrics, the 


closer the R? to 1, the better the model’s performance; the smaller the RMSE, MAPE, and MAE, 
the better the model’s performance. 


3.4. Study step 
The main step of the research method in the present study are as below: 


As mentioned previously, the entire dataset is divided into two parts: the training dataset 
involving 314 soil samples and the testing dataset involving 35 soil samples. Then, we use the 
training dataset to establish the RF and XGBoost models, respectively. In this process, the grid 
search algorithm is employed to seek the optimal hyperparameters of the RF and XGBoost 
models [38]. The hyperparameters of the RF model are the number of trees and the maximum 
depth, and the hyperparameters of the XGBoost model are the number of trees and the learning 
rate. Meanwhile, a five-fold cross-validation algorithm is used when training the RF and 
XGBoost models, which aims to help the models avoid overfitting. After determining the 
hyperparameters of the RF and XGBoost models, we use the testing dataset to examine their 
generalization ability. Lastly, we also analyze which variable is highly sensitive for predicting the 
compression index of clayey soils. Figure 4 displays the flowchart of the present study. In the 
present study, we used two open-source Python libraries: Scikit-learn [39] and XGBoost [35,36] 
to develop the RF and XGBoost models, respectively. 
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Fig. 4. Flowchart of the study step. 
4. Results 


4.1. Evaluation of model performance 


In this section, we trained the RF and XGBoost on the same training dataset and then determined 
their respective optimal hyperparameters. First, we defined the searching domain of these two 
models, as shown in Table 2. Regarding the RF model, the searching domain of its hyper- 
parameters is: the number of trees increases from 50 to 300 with the increment of 10, and the 
maximum depth increases from 1 to 20 with the increment of 1. Regarding the XGBoost model, 
the searching domain of its hyperparameters is: the number of trees increases from 50 to 300 
with the increment of 10, and the learning rate increases from 0.01 to 0.30 with the increment of 
0.01. We then use the mean squared error (MSE) as an evaluation metric to determine the 
optimal hyperparameters of each model. 


Figure 5 illustrates the possible results of hyperparameters of the RF model. We can find that the 
MSE reached a relatively large value when the maximum depth is less than 7. When the 
maximum depth is larger than 7, the value of MSE does not fluctuate strongly. Another point is 
that the maximum depth has a significant influence on the MSE compared with the number of 
trees because the MSE significantly reduced with the increase of the maximum depth. 
Conclusively, according to Figure 5 (b), the optimal hyperparameters of the RF model are: the 
number of trees is 130 and the maximum depth is 10. 


Figure 6 illustrates the possible results of hyperparameters of the XGBoost model. We can find 
that the MSE reached a relatively large value only when the number of trees and learning rate are 
both in small values. For other cases, the MSE does not have obvious changes. According to 
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Figure 6 (b), we determined the optimal hyperparameters of the XGBoost model, that is, the 


number of trees is 60 and the learning rate is 0.14. 


Table 2 
Hyperparameters of the RF and XGBoost models. 


Model Hyperparameter Searching domain 


Increment 


RF Number of trees [50, 300] 


Maximum depth [1, 20] 


XGBoost Number of trees [50, 300] 


Learning rate [0.01, 0.30] 
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Fig. 5. Determination of the hyperparameters of the RF model. 
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Fig. 6. Determination of the hyperparameters of the XGBoost model. 
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After figuring out the values of the hyperparameters of each model, we use the obtained 
hyperparameters to construct the RF and XGBoost models, respectively. Subsequently, we 
examined their performance on both training and testing datasets. At the same time, we also 
compared them with the model (ANN) in the published article [29]. Table 3 shows the 
performance of the RF and XGBoost models on the training and testing datasets. As a result, the 
RF model has the lowest error on both training and testing datasets compared with the XGBoost 
and ANN models. Its performance indices are as follows: R’ of 0.928, RMSE of 0.016, MAPE of 
7.046%, and MAE of 0.012 on the training dataset; R? of 0.818, RMSE of 0.025, MAPE of 
10.082%, and MAE of 0.020 on the testing dataset. Additionally, we can also find that both RF 
and XGBoost models outperform the ANN model. Thus, we conclude that the tree-based models 
have a promising prospect of predicting the compression index of clayey soils. 


Figure 7 shows the comparison between the experimental compression index and the predicted 
compression index by the RF model. Intuitively, for the training dataset, almost all the data 
points are concentrated around the black dashed line. This indicates the compression index 
predicted by the RF model approximates the experimental compression index. As for the testing 
dataset, most of the data points are concentrated around the black dashed line, but several data 
points are not. This indicates although the generalization ability of the current RF model is 
acceptable, it still needs further improvement. Overall, the developed RF model shows 
acceptable and effective performance on both the training and testing datasets. 


Table 3 
Comparison of models’ performance. 
Model Training dataset Testing dataset 
R? RMSE MAPE(%) MAE R? | RMSE MAPE(%) MAE 
RF 0.928 0.016 7.046 0.012 0.818 0.025 10.082 0.020 
XGB 0.832 0.024 10.933 0.019 0.833 0.026 11.125 0.021 
ANN [29] - 0.035 13.340 0.027 - 0.034 13.170 0.027 
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Fig. 7. Predicted and experimental compression index. 
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4.2. Sensitivity analysis 


From the above analysis, we have successfully obtained the best tree-based model for predicting 
compression index, namely, the RF model. In this section, we will figure out which variable 
shows the highest influence on predicting compression index when using the RF model. The RF 
model has an intrinsic attribute: feature importance, which can measure the importance of each 
feature when constructing a split node in a decision tree. The standard for constructing the split 
node is “squared error” when the prediction is a regression task. Its main principle is to minimize 
the L2 loss using the mean of each split node [40]. In short, the more times the feature is used in 
a split node to minimize the L2 loss, the higher its importance. Based on this, we can obtain the 
importance of each feature (variable), as shown in Figure 8. Intuitively, for the present 
engineering instance, the variable eo, 1.e., the initial void ratio, shows the highest impact on 
predicting compression index; the variables G, and @,, 1.e., the specific gravity of soil particles 
and natural water content, show relatively slight impact on predicting compression index; the 
variables PI and LL, i.e., the plastic index and the liquid index, show negligible impact on 
prediction compression index. As a result, we conclude that the initial void ratio should be a 
significant concern in predicting the compression index. 


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 
Importance 


Fig. 8. Importance of each variable in predicting compression index. 


Further, to identify the specific effect of the initial void ratio (e9) on the compression index, we 
used the Partial Dependence Plots and Individual Conditional Expectation plots to achieve 
visualization and analysis of the interaction between the initial void ratio and the compression 
index. In general, Partial Dependence Plots can show the average (overall) dependence between 
the target response and the input feature of interest [41]. Individual Conditional Expectation plots 
can reflect the individual dependence between the target response and the input feature of 
interest—based on the selected data samples [42]. Figure 9 shows the particular effect of the 
initial void ratio on the compression index. The red dashed line represents the average 
dependence between the initial void ratio and the compression index. Intuitively, the 
compression index increases with the increase of the initial void ratio, especially when the initial 
void ratio is between 0.58 and 0.90. However, when the initial void ratio is between 0.476 and 
0.58 as well as 0.90 and 1.059, the compression index is almost unchanged. As for the individual 
dependence between the initial void ratio and the compression index (all blue lines), most of the 
data samples present a similar trend to the red dashed line—although few of them are fluctuant. 
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In summary, the relationship between the initial void ratio and the compression index is 
approximately positive linear, which is beneficial for us to determine the compression index. 
Some published studies also pointed out that the compression index of clayey soils is highly 
dependent on the initial void ratio. For instance, Tiwari and Ajmera reported a significant linear 
relationship between the compression index and the initial void ratio [43]. Akbarimehe et al. also 
concluded a valid linear correlation between the compression index and the initial void ratio 
through the consolidation tests [44]. Erzin et al. developed an empirical formula based on a 
robust optimization model and found that the compression index is more sensitive to the initial 
void ratio [45]. 
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Fig. 9. Dependence between the initial void ratio and the compression index. 
5. Conclusion 


In the present study, we proposed two tree-based models (RF and XGBoost) to predict the 
compression index of clayey soils. First, a dataset for soil consolidation tests, collected from a 
previously published work, was utilized to develop the tree-based models. In the tree-based 
models, the input parameters included natural water content (@,), liquid limit (LL), plastic index 
(PI), initial void ratio (eo), and specific gravity of soil particles (G,), whereas the compression 
index is the target output. Then, we used a grid search algorithm to seek the optimal 
hyperparameters of the tree-based models. As a result, the optimal hyperparameters of the RF 
model are: number of trees = 130, maximum depth = 10, and the optimal hyperparameters of the 
XGBoost model are: number of trees = 60, learning rate = 0.14. By comparing their performance 
on both the training and testing datasets, we found that the RF model outperformed the XGBoost 
model. The RF model obtained the lower errors when implementing the task of predicting the 
compression index of clayey soils, evidenced by R’ of 0.928 and 0.818, RMSE of 0.016 and 
0.025, MAPE of 7.046% and 10.082%, and MAE of 0.012 and 0.020 on the training and testing 
datasets, respectively. This confirms that the RF model can help in reducing the cost of 
implementing laboratory experiments to determine the compression index of clayey soils. 
Furthermore, according to the feature importance of input parameters in the RF model, we found 
that the initial void ratio (eo) has a significant impact on predicting the compression index in the 
present engineering instance. This is beneficial for engineers to understand the compression 
characteristics of clayey soils—we emphasize an approximately positive linear relationship 
between the initial void ratio (e9) and the compression index of clayey soils and we recommend 
the engineers focus on this point when dealing with similar scenarios. 
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